Predictive Policing - Technical Implementation

Exploring Burglaries Patterns in Chicago and Predictive Modeling with 311 Data

Author

Yuqing(Demi) Yang

Published

December 9, 2025

Assignment Overview

In this lab, you will apply the spatial predictive modeling techniques demonstrated in the class exercise to a 311 service request type of your choice. You will build a complete spatial predictive model, document your process, and interpret your results.

Timeline & Deliverables

Due Date: November 17, 2025, 10:00AM

Deliverable: One rendered document, posted to your portfolio website.

Learning Goals

By completing this assignment, you will demonstrate your ability to:

  • Adapt example code to analyze a new dataset
  • Build spatial features for predictive modeling
  • Apply count regression techniques to spatial data
  • Implement spatial cross-validation
  • Interpret and communicate model results
  • Critically evaluate model performance

Step 0: Set Up

Code
# Load required packages
library(tidyverse)      # Data manipulation
library(sf)             # Spatial operations
library(here)           # Relative file paths
library(viridis)        # Color scales
library(terra)          # Raster operations (replaces 'raster')
library(spdep)          # Spatial dependence
library(FNN)            # Fast nearest neighbors
library(MASS)           # Negative binomial regression
library(patchwork)      # Plot composition (replaces grid/gridExtra)
library(knitr)          # Tables
library(kableExtra)     # Table formatting
library(classInt)       # Classification intervals
library(here)

# Spatstat split into sub-packages
library(spatstat.geom)    # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE

# Set options
options(scipen = 999)  # No scientific notation
set.seed(5080)         # Reproducibility

# Create consistent theme for visualizations
theme_default <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(color = "gray30", size = base_size - 1),
      legend.position = "right",
      panel.grid.minor = element_blank(),
      axis.text = element_blank(),
      axis.title = element_blank()
    )
}

# Set as default
theme_set(theme_default())

cat("✓ All packages loaded successfully!\n")
✓ All packages loaded successfully!

Step 1: Getting the Data

1.1 Load Chicago Spatial Data

Code
# Load police districts (used for spatial cross-validation)
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)

# Load police beats (smaller administrative units)
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(Beat = beat_num)

# Load Chicago boundary
chicagoBoundary <- 
  st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
  st_transform('ESRI:102271')

cat("  - Police districts:", nrow(policeDistricts), "\n")
cat("  - Police beats:", nrow(policeBeats), "\n")

1.2 Load 311 Rodent Baiting Data

Code
rb <- read_csv("data/311_Service_Requests_-_Rodent_Baiting_-_Historical_20251114.csv")
head(rb)

rb_sf <- rb %>%
  filter(!is.na(Latitude), !is.na(Longitude)) %>% 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>% 
  st_transform("ESRI:102271")  

rb_sf <- rb_sf %>% 
  filter(st_within(., chicagoBoundary, sparse = FALSE))

# filter 2017 data
rb_sf$CreationDate <- as.Date(rb_sf$`Creation Date`, format = "%m/%d/%Y")
rb_sf_2017 <- rb_sf %>% 
  filter(format(CreationDate, "%Y") == "2017")

1.3 Load Burglary Data

Code
# Load from provided data file (downloaded from Chicago open data portal)
burglaries <- st_read("data/burglaries.shp") %>% 
  st_transform('ESRI:102271')

# Check the data
cat("  - Number of burglaries:", nrow(burglaries), "\n")
cat("  - CRS:", st_crs(burglaries)$input, "\n")

1.4 Visualize Point Data

1.4.1 Rodent Baiting Data

Code
# Simple point map for Rodent Baiting
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = rb_sf_2017, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Rodent Baiting 311 Service Requests",
    subtitle = paste0("Chicago, n = ", nrow(rb_sf_2017))
  )

# Density surface using 311 Rodent Baiting data
p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(rb_sf_2017)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation of Rodent Baiting Requests"
  )

# Combine plots using patchwork
p1 + p2 + 
  plot_annotation(
    title = "Spatial Distribution of Rodent Baiting 311 Requests in Chicago",
    tag_levels = "A"
  )

  • The maps above show the spatial distribution of Rodent Baiting 311 requests across Chicago.
  • The point map on the left illustrates the raw locations of all reported rodent issues. The large number of points makes the overall pattern dense, but we can still see clear clusters in the northern and western parts of the city.
  • The density map on the right provides a smoother view of these patterns. Several strong hotspots emerge, especially in the Northwest Side and Near North areas. In contrast, the southern and far southwestern parts of Chicago show much lower intensity.

1.4.2 Burgalaries Data

Code
# Simple point map for Burglaries (P3)
p3 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = burglaries, color = "#003f5c", size = 0.15, alpha = 0.5) +
  labs(
    title = "Residential Burglaries",
    subtitle = paste0("Chicago, n = ", nrow(burglaries))
  )

# Density surface for Burglaries (P4)
p4 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(burglaries)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation of Residential Burglaries"
  )

# Combine P3 & P4
p3 + p4 +
  plot_annotation(
    title = "Spatial Distribution of Residential Burglaries in Chicago",
    tag_levels = "A"
  )

  • The maps show where residential burglaries were concentrated in Chicago.
  • The darker purple areas represent the highest concentrations of burglaries. These hot spots appear mainly in the north and south-central parts of the city.
  • Overall, the maps suggest that residential burglaries cluster in specific neighborhoods rather than occurring randomly across Chicago.

1.5 Chapter Summary

  • Load maps and boundaries of Chicago:
    • This chapter loads police districts, police beats, and the city boundary.
    • These layers help define the study area and later allow spatial analysis.
    • All maps are changed into the same coordinate system so they line up correctly.
  • Load 311 Rodent Baiting data:
    • This chapter loads Chicago 311 Rodent Baiting data, and keeps only records from 2017.
  • Load burglary data:
    • A shapefile of burglary cases is loaded.
    • It is also transformed into the same coordinate system.
    • This dataset will be compared with the rodent data later.
  • Purpose of Step 1:
    • Gather most of the data needed for the project.
    • Clean and filter the data so it is accurate and consistent.
    • Make basic maps to understand where events happen.

Step 2: Fishnet Grid Creation

2.1 Create Fishnet Grid

Code
# Create 500m x 500m grid
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,  # 500 meters per cell
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]

# View basic info
cat("  - Number of cells:", nrow(fishnet), "\n")
  - Number of cells: 2458 
Code
cat("  - Cell size:", 500, "x", 500, "meters\n")
  - Cell size: 500 x 500 meters
Code
cat("  - Cell area:", round(st_area(fishnet[1,])), "square meters\n")
  - Cell area: 250000 square meters

2.2 Aggregate data to Grid

Code
# Spatial join: which cell contains each rodent baiting request/burglaries?

rodent_fishnet <- st_join(rb_sf_2017, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarise(countRodent = n(), .groups = "drop")

burglary_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarise(countBurglaries = n(), .groups = "drop")

# Join back to fishnet (cells with 0 requests will be NA)

fishnet <- fishnet %>%
  left_join(rodent_fishnet,   by = "uniqueID") %>%
  left_join(burglary_fishnet, by = "uniqueID") %>%
  mutate(
    countRodent     = tidyr::replace_na(countRodent, 0),
    countBurglaries = tidyr::replace_na(countBurglaries, 0)
  )
  • Summary statistics for Rodent baiting request
Code
print(summary(fishnet$countRodent))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    2.00   11.00   20.68   30.00  180.00 
Code
cat("\nCells with zero rodent requests:",
    sum(fishnet$countRodent == 0), "/",
    nrow(fishnet), "(",
    round(100 * sum(fishnet$countRodent == 0) / nrow(fishnet), 1), "% )\n")

Cells with zero rodent requests: 500 / 2458 ( 20.3 % )
  • Summary statistics for Burglary
Code
print(summary(fishnet$countBurglaries))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   2.000   3.042   5.000  40.000 
Code
cat("\nCells with zero burglaries:",
    sum(fishnet$countBurglaries == 0), "/",
    nrow(fishnet), "(",
    round(100 * sum(fishnet$countBurglaries == 0) / nrow(fishnet), 1), "% )\n")

Cells with zero burglaries: 781 / 2458 ( 31.8 % )

2.3 Visualize Rodent Baiting with fishnet

Code
ggplot() +
  geom_sf(data = fishnet, aes(fill = countRodent), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name   = "Rodent Baiting Requests",
    option = "plasma",
    trans  = "sqrt",
    breaks = c(0, 5, 10, 20, 40, 80, 180),
    limits = c(0, 180), 
    oob    = scales::squish
  ) +
  labs(
    title = "Rodent Baiting 311 Requests by Grid Cell",
    subtitle = "500m x 500m cells, Chicago,2017"
  ) +
  theme_default()

Code
quant_rb <- tibble(
  Percentile = c("0%", "10%", "20%", "30%", "40%", "50%", "60%", "70%", "80%", "90%", "100%"),
  Value = quantile(fishnet$countRodent, probs = seq(0,1,0.1))
)

knitr::kable(
  quant_rb,
  caption = "Quantile Distribution of Rodent Baiting 311 Request Counts"
)
Quantile Distribution of Rodent Baiting 311 Request Counts
Percentile Value
0% 0
10% 0
20% 0
30% 3
40% 7
50% 11
60% 17
70% 24
80% 36
90% 56
100% 180
  • Most grid cells have very low request counts. According to the quantile table, 30% of the cells have 3 or fewer requests, and 50% have 11 or fewer.
  • This pattern shows a strong right-skewed distribution, where a small number of hotspots generate many more requests than the rest of the city.
  • On the map, hotspots appear mainly in the central and north neighborhoods, forming clear clusters of higher counts, and suggesting strong spatial heterogeneity.

2.4 Visualize Burgalaries with fishnet

Code
# Burglary 500m x 500m grid map
ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name   = "Residential Burglaries",
    option = "plasma",
    trans  = "sqrt",
    breaks = c(0, 1, 2, 3, 5, 8, 40),
    limits = c(0, 40),
    oob    = scales::squish
  ) +
  labs(
    title = "Residential Burglaries by Grid Cell",
    subtitle = "500m x 500m cells, Chicago, 2017"
  ) +
  theme_default()

Code
quant_burglary <- tibble(
  Percentile = c("0%", "10%", "20%", "30%", "40%", "50%", "60%", "70%", "80%", "90%", "100%"),
  Value = quantile(fishnet$countBurglaries, probs = seq(0, 1, 0.1), na.rm = TRUE)
)

knitr::kable(
  quant_burglary,
  caption = "Quantile Distribution of Residential Burglary Counts"
)
Quantile Distribution of Residential Burglary Counts
Percentile Value
0% 0
10% 0
20% 0
30% 0
40% 1
50% 2
60% 3
70% 4
80% 5
90% 8
100% 40
  • Most grid cells have very few burglaries. From the quantiles, 50% of all cells have 2 or fewer burglary cases.
  • The distribution is right-skewed. Many places have low values (0–3), while only a few places have high values.
  • On the map, brighter colors (yellow) show where burglaries are concentrated. These hotspots appear mainly in the northern and southern parts of Chicago.
  • The pattern suggests burglary risk is not uniform across the city. Instead, it clusters in specific neighborhoods, which is important for modeling and prediction.

2.5 Chapter Summary

  • Creating the Fishnet Grid:
    • This chapter created a 500m × 500m grid that covers the whole Chicago boundary.
    • Each grid cell gets a unique ID, which helps match spatial features later.
    • The grid makes it easier to compare different locations using the same unit of space.
  • Aggregating Rodent and Burglary Data:
    • Each 311 rodent request and burglary point is assigned to a grid cell using a spatial join.
    • Then, this chapter counted how many events happen in each cell.
    • Cells with no events are filled with zero, so the dataset stays complete.
    • This step converts point data into count data, which is needed for statistical modeling.
  • Mapping Rodent Baiting Counts:
    • The map shows clear hotspots where rodent activity is high.
    • The distribution is very skewed, meaning most cells have low values, but a few cells have extremely high values.
    • This pattern suggests strong spatial clustering.
  • Mapping Burglary Counts:
    • Burglary counts also show uneven spatial distribution.
    • Many cells have 0–3 burglaries, while only a small number of cells have high counts.
    • This map confirms that burglaries are not random but cluster in specific neighborhoods.
  • Overall Importance of Step 2:
    • This chapter transforms raw point data into a structured spatial dataset.
    • It creates the base layer (the grid) where all future variables will be attached.
    • It also reveals important spatial patterns that guide later modeling steps.

Step 3: Spatial Features

3.1 k-nearest neighbor

Code
# 1. Get centroid coordinates
fishnet_coords <- st_coordinates(st_centroid(fishnet))

# 2. Compute nearest 8 neighbors (to other grid cells)
knn_result <- get.knn(fishnet_coords, k = 8)

# 3. Add KNN features to fishnet
fishnet <- fishnet %>%
  mutate(
    knn_mean = rowMeans(matrix(countRodent[knn_result$nn.index], nrow = nrow(fishnet))),
    knn_sum  = rowSums(matrix(countRodent[knn_result$nn.index],  nrow = nrow(fishnet)))
  )

summary(fishnet$knn_mean)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   6.281  15.562  20.992  31.500 112.125 
Code
summary(fishnet$knn_sum)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   50.25  124.50  167.94  252.00  897.00 
  • I calculated the summary and mean values of neighboring grid cells instead of the distance nearest event points. This method captures spatial spillover between neighborhoods better, providing stronger and more reliable predictive power.

3.2 Distance to Hot Spots

3.2.1 Perform Local Moran’s I analysis

Code
# Function to calculate Local Moran's I
calculate_local_morans <- function(data, variable, k = 5) {
  
  # Create spatial weights based on k-nearest neighbors
  coords <- st_coordinates(st_centroid(data))
  neighbors <- knn2nb(knearneigh(coords, k = k))
  weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)
  
  # Calculate Local Moran's I
  local_moran <- localmoran(data[[variable]], weights)
  
  # Classify clusters
  mean_val <- mean(data[[variable]], na.rm = TRUE)
  
  data %>%
    mutate(
      local_i = local_moran[, 1],
      p_value = local_moran[, 5],
      is_significant = p_value < 0.05,
      
      moran_class = case_when(
        !is_significant ~ "Not Significant",
        local_i > 0 & .data[[variable]] > mean_val ~ "High-High",
        local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low",
        local_i < 0 & .data[[variable]] > mean_val ~ "High-Low",
        local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High",
        TRUE ~ "Not Significant"
      )
    )
}

# Apply to rodent baiting counts
fishnet <- calculate_local_morans(fishnet, "countRodent", k = 5)

3.2.2 Visualize Morans

Code
#| fig-width: 8
#| fig-height: 6

# Visualize Local Moran's I clusters for rodent baiting
ggplot() +
  geom_sf(
    data = fishnet, 
    aes(fill = moran_class), 
    color = NA
  ) +
  scale_fill_manual(
    values = c(
      "High-High"       = "#d7191c",  # hotspots
      "High-Low"        = "#fdae61",
      "Low-High"        = "#abd9e9",
      "Low-Low"         = "#2c7bb6",  # cold spots
      "Not Significant" = "gray90"
    ),
    name = "Cluster Type"
  ) +
  labs(
    title = "Local Moran's I: Rodent Baiting Clusters",
    subtitle = "High-High = Hot spots of rodent activity"
  ) +
  theme_default()

  • Why there is no “Low-Low” and “High-Low” clusters?
    • The data shows a highly right-skewed distribution.
    • When high-value areas are clustered together, they are very likely to be identified as High-High (hotspot).
    • Low-value areas are everywhere, so “low + low + low” is too “normal” for the tests. As a result, it will not be marked as significantly “Low-Low”, but rather “Not Significant”.
    • In the reality, rodent activity is concentrated in a small number of hotspots, while low counts are widespread across the city.

3.2.3 Distance to Hotspots

Code
# Get centroids of "High-High" cells (rodent hotspots)
hotspots <- fishnet %>%
  filter(moran_class == "High-High") %>%
  st_centroid()

# Calculate distance from each cell to nearest rodent hotspot
if (nrow(hotspots) > 0) {
  fishnet <- fishnet %>%
    mutate(
      dist_to_hotspot = as.numeric(
        st_distance(st_centroid(fishnet), hotspots %>% st_union())
      )
    )
  

  cat("  - Number of hot spot cells:", nrow(hotspots), "\n")
} else {
  fishnet <- fishnet %>%
    mutate(dist_to_hotspot = 0)
  cat("⚠ No significant rodent baiting hot spots found\n")
}
  - Number of hot spot cells: 304 

3.3 Building Age

3.3.1 Building Age Feature

Code
# Read building data

building_sf <- st_read("data/Building_Footprints_20251117.geojson") %>%
  
# buildings_sf <- st_read(
#   "https://data.cityofchicago.org/resource/syp8-uezg.geojson",
#  quiet = TRUE
# ) %>%


# Clean and calculate the building years of 2017
 filter(!is.na(year_built)) %>%
  mutate(year_built = as.numeric(year_built)) %>%
  filter(year_built > 0, year_built <= 2017) %>%
  mutate(bldg_age_2017 = 2017 - year_built) %>%
  st_transform('ESRI:102271')
Reading layer `Building_Footprints_20251117' from data source 
  `/Users/demiyang/Downloads/MUSA/PPA/portfolio/portfolio-setup-demiyang12/assignments/assignment_4/data/Building_Footprints_20251117.geojson' 
  using driver `GeoJSON'
replacing null geometries with empty geometries
Simple feature collection with 820606 features and 46 fields (with 6 geometries empty)
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.93981 ymin: 41.6446 xmax: -87.52421 ymax: 42.023
Geodetic CRS:  WGS 84
Code
bldg_by_grid <- st_join(fishnet, building_sf, join = st_intersects) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(
    n_buildings     = sum(!is.na(bldg_age_2017)),
    mean_bldg_age   = ifelse(
      n_buildings == 0, NA_real_,
      mean(bldg_age_2017, na.rm = TRUE)
    ),
    median_bldg_age = ifelse(
      n_buildings == 0, NA_real_,
      median(bldg_age_2017, na.rm = TRUE)
    ),
    .groups = "drop"
  )

fishnet <- fishnet %>%
  left_join(bldg_by_grid, by = "uniqueID")

3.3.2 Visualize Building Age

Code
ggplot() +
  geom_sf(data = fishnet, aes(fill = mean_bldg_age), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name   = "Mean building age (years)",
    option = "magma",
    direction = -1 ,
    na.value = "grey90"
  ) +
  labs(
    title    = "Average Building Age by Grid Cell",
    subtitle = "500m x 500m cells, Chicago, 2017"
  ) +
  theme_default()

  • The map shows large differences in building age across Chicago.
  • Newer buildings appear more often in the central and far north areas, where the colors are lighter.
  • The pattern suggests that building age is not evenly distributed.Older structures cluster in certain neighborhoods, while others have more recent development.
  • Many burglary hotspots appear in areas where buildings are older.
    • These locations, mainly in the north, and southeast parts of Chicago—show both high burglary counts and high building age.
    • Older buildings may have weaker physical security, such as outdated doors, windows, or locks.
    • Older neighborhoods may also have denser housing, which increases opportunities for burglary.

3.4 Land Use

3.4.1 Land Use Feature

Code
# Read the zoning data

# zoning_sf <- st_read(
#  "https://data.cityofchicago.org/resource/dj47-wfun.geojson",
#  quiet = TRUE
# )

zoning_sf <- st_read("data/Boundaries_-_Zoning_Districts_(current)_20251117.geojson")
Reading layer `Boundaries_-_Zoning_Districts_(current)_20251117' from data source `/Users/demiyang/Downloads/MUSA/PPA/portfolio/portfolio-setup-demiyang12/assignments/assignment_4/data/Boundaries_-_Zoning_Districts_(current)_20251117.geojson' 
  using driver `GeoJSON'
Simple feature collection with 14814 features and 28 fields (with 4 geometries empty)
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94009 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS:  WGS 84
Code
# Extract the letter parts of the first three characters from ZONE_CLASS
zoning_sf <- zoning_sf %>%
  mutate(
    zone_prefix = substr(zone_class, 1, 3),
    zone_prefix = gsub("[^A-Za-z]", "", zone_prefix)
  ) %>%
  filter(zone_prefix != "") %>%
  mutate(
    zone_code = dplyr::recode(
      zone_prefix,
      "RS"  = "1",
      "RT"  = "2",
      "RM"  = "3",
      "B"   = "4",
      "C"   = "5",
      "DC"  = "6",
      "DR"  = "7",
      "DS"  = "8",
      "DX"  = "9",
      "M"   = "10",
      "PMD" = "11",
      "PD"  = "12",
      "T"   = "13",
      "POS" = "14",
      .default = NA_character_
    )
  ) %>%
  filter(!is.na(zone_code)) %>%
  
  st_transform("ESRI:102271")

zoning_sf$zone_code <- as.integer(zoning_sf$zone_code)

# Calculate the mode
mode_first <- function(x) {
  x <- x[!is.na(x)]
  if (length(x) == 0) return(NA)
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

# Aggregate to fishnet: One zoning type for each grid
zoning_by_grid <- st_join(fishnet, zoning_sf, join = st_intersects) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(
    zone_prefix_dom = mode_first(zone_prefix), 
    zone_code_dom   = mode_first(zone_code),
    .groups = "drop"
  )

# join the zoning variable back to fishnet
fishnet <- fishnet %>%
  left_join(zoning_by_grid, by = "uniqueID")

3.4.2 Visualize Land Use

Code
# Create a zoning classification with complete instructions
fishnet <- fishnet %>%
  mutate(
    zone_cat = factor(
      zone_prefix_dom,
      levels = c("RS", "RT", "RM", "B",  "C",  "DC", "DR", "DS", "DX",
                 "M",  "PMD", "PD", "T", "POS"),
      labels = c(
        "RS = Residential Single-Unit District",
        "RT = Residential Two-Flat, Townhouse and Multi-Unit District",
        "RM = Residential Multi-Unit District",
        "B = Business",
        "C = Commercial",
        "DC = Downtown Core District",
        "DR = Downtown Residential District",
        "DS = Downtown Service District",
        "DX = Downtown Mixed-Use District",
        "M = Manufacturing",
        "PMD = Planned Manufacturing Districts",
        "PD = Planned Development",
        "T = Transportation",
        "POS = Parks and Open Space"
      )
    )
  )

#| fig-width: 8
#| fig-height: 6

ggplot() +
  geom_sf(data = fishnet, aes(fill = zone_cat), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_d(
    name   = "Zoning (dominant class)",
    option = "turbo", 
    na.value = "grey90"
  ) +
  labs(
    title    = "Dominant Zoning Category by Grid Cell",
    subtitle = "500m x 500m cells, Chicago"
  ) +
  theme_default() +
  theme(
    legend.key.height = unit(0.6, "cm"),
    legend.key.width  = unit(0.6, "cm")
  )

  • The map shows the dominant zoning type in each 500m × 500m grid cell in Chicago.
  • The map shows Chicago has a mixed pattern, but residential zones dominate most of the city.
  • Downtown districts (DR, DS, DX) are concentrated in the central east area.
  • Manufacturing (M, PMD) zones cluster in the southeast, and middlewest.
  • Comparing with burglary distribution, burglary hotspots occur mostly in residential zones.
    • This makes sense because burglary is a residential crime, so it appears mainly where people live.
  • Manufacturing zones have low burglary counts.
    • These areas have fewer homes, more industrial facilities, and lower residential density.

3.5 Chapter Summary

  • k-Nearest Neighbor (KNN) Feature for 311 request feature:
    • For every cell, this chapter found its 8 nearest neighboring cells.
    • Then this chapter calculated knn_mean, which is the average rodent request count in those neighbors, and knn_sum, which is the total rodent request count in those neighbors.
  • Distance to Hot Spots Feature(Local Moran’s I):
    • Each grid cell was classified into: High–High (hotspot), Low–Low, High–Low, Low–High, or Not Significant. In practice, most significant clusters are High–High hotspots, while many low-count areas are “Not Significant” instead of Low–Low.
    • Then this chapter took the centroids of High–High cells and measured, for every grid cell, the distance to the nearest hotspot (dist_to_hotspot).
    • This creates a feature that shows how close each place is to intense rodent activity.
  • Building Age Feature:
    • Building data was loaded from Chicago Data Portal, named Building Footprints.
    • This chapter cleaned the data and kept only buildings built before or in 2017, then computed building age in 2017.
    • Then, for each grid cell, this chapter calculated: n_buildings,which is this number of buildings with valid age. mean_bldg_age, which is the average building age. median_bldg_age, which is the median building age.
    • These values were joined back to the fishnet, so every cell has information about its building stock and age structure.
  • Land Use Feature:
    • Zoning data was loaded from Chicago Data Portal, named Boundaries - Zoning Districts (current).
    • This chapter aggregated the zoning data into the secondary classification of zoning classes.(reference: https://secondcityzoning.org/zones/)
    • For each cell, this chapter identified its function as its dominant zoning type.
    • A map of these categories shows how residential, commercial, downtown, industrial, and park areas are distributed across Chicago.
  • Purpose of Step 3:
    • Step 3 builds key spatial predictor variables on top of the grid: neighborhood intensity (knn_mean), proximity to rodent hotspots (dist_to_hotspot), built environment age (mean_bldg_age), and land use type (zone_code_dom).
    • They are later used in the Poisson and Negative Binomial regression models as main explanatory variables.

Step 4: Count Regression Models

4.1 Poisson regression

4.1.1 Fit Poisson regression

Code
#| label: poisson-reg
#| message: false
#| warning: false
#| results: hide

fishnet$zone_code_dom <- as.factor(fishnet$zone_code_dom)

pois_model <- glm(
  countBurglaries ~ mean_bldg_age + zone_prefix_dom + dist_to_hotspot + knn_mean,
  data = fishnet,
  family = "poisson"
)

summary(pois_model)

Call:
glm(formula = countBurglaries ~ mean_bldg_age + zone_prefix_dom + 
    dist_to_hotspot + knn_mean, family = "poisson", data = fishnet)

Coefficients:
                        Estimate    Std. Error z value             Pr(>|z|)    
(Intercept)         -0.250648101   0.072331678  -3.465              0.00053 ***
mean_bldg_age        0.018883092   0.000700108  26.972 < 0.0000000000000002 ***
zone_prefix_domC    -0.082138237   0.045501808  -1.805              0.07105 .  
zone_prefix_domDR  -13.675893085 196.752921310  -0.070              0.94459    
zone_prefix_domDS  -13.019003189 199.994386084  -0.065              0.94810    
zone_prefix_domDX    0.185584582   0.197946901   0.938              0.34848    
zone_prefix_domM    -0.783250148   0.063368112 -12.360 < 0.0000000000000002 ***
zone_prefix_domPD   -0.215354060   0.053979269  -3.990        0.00006619324 ***
zone_prefix_domPMD  -0.811643665   0.138020572  -5.881        0.00000000409 ***
zone_prefix_domPOS  -0.262883964   0.088536272  -2.969              0.00299 ** 
zone_prefix_domRM    0.158877527   0.052655101   3.017              0.00255 ** 
zone_prefix_domRS   -0.167196608   0.031819723  -5.254        0.00000014843 ***
zone_prefix_domRT   -0.027109217   0.040373445  -0.671              0.50193    
zone_prefix_domT    -0.577170309   0.500635901  -1.153              0.24896    
dist_to_hotspot     -0.000010614   0.000005785  -1.835              0.06658 .  
knn_mean             0.002078690   0.000811532   2.561              0.01042 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 8415.3  on 2199  degrees of freedom
Residual deviance: 6650.8  on 2184  degrees of freedom
  (258 observations deleted due to missingness)
AIC: 11849

Number of Fisher Scoring iterations: 10

4.1.2 Check for Overdispersion

Code
#| label: poisson-dispersion-check
#| message: false
#| warning: false

dispersion_rb <- sum(residuals(pois_model, type = "pearson")^2) /
                 pois_model$df.residual

cat("Dispersion parameter for rodent baiting Poisson model:",
    round(dispersion_rb, 2), "\n")
Dispersion parameter for rodent baiting Poisson model: 3.47 
Code
cat("Rule of thumb: values > 1.5 indicate overdispersion.\n")
Rule of thumb: values > 1.5 indicate overdispersion.
Code
if (dispersion_rb > 1.5) {
  cat("⚠ Overdispersion detected for rodent baiting counts.\n")
  cat("  The Poisson model is likely too restrictive.\n")
  cat("  A Negative Binomial model is more appropriate.\n")
} else {
  cat("✓ Dispersion looks acceptable.\n")
  cat("  The Poisson model may be adequate for these counts.\n")
}
⚠ Overdispersion detected for rodent baiting counts.
  The Poisson model is likely too restrictive.
  A Negative Binomial model is more appropriate.

4.2 Fit Negative Binomial regression

Code
nb_model <- glm.nb(
  countBurglaries ~ mean_bldg_age + zone_prefix_dom + dist_to_hotspot + knn_mean,
  data = fishnet
)

summary(nb_model)

4.3 Compare model fit (AIC)

Code
model_compare <- data.frame(
  Model = c("Poisson", "Negative Binomial"),
  AIC   = c(AIC(pois_model), AIC(nb_model))
)


knitr::kable(model_compare, caption = "Model Comparison: Poisson vs. NB")
Model Comparison: Poisson vs. NB
Model AIC
Poisson 11848.599
Negative Binomial 9836.521
Code
dispersion_row <- data.frame(
  Statistic = "Poisson Dispersion",
  Value     = dispersion_rb
)

knitr::kable(dispersion_row, caption = "Poisson Dispersion Statistic")
Poisson Dispersion Statistic
Statistic Value
Poisson Dispersion 3.474401

4.4 Chapter Summary

  • In this chapter, I built statistical models to explain how burglary counts change with spatial features:
    building age, zoning, distance to rodent baiting counts hotspots, and mean values of rodent baiting counts in neighboring grid cells.
  • Then, this chapter first fitted a Poisson regression with these variables and then checked the dispersion of the residuals. The Poisson dispersion is about 3.47, which is much larger than 1.5. This means the data are strongly overdispersed (variance much higher than the mean), so the Poisson model is too restrictive.
  • To handle overdispersion, research then fitted a Negative Binomial (NB) regression with the same main predictors.This model allows extra variance and is better suited for these burglary counts.
  • Poisson model has an AIC of about 11849, while the Negative Binomial model has an AIC of about 9837. A lower AIC means a better fit, so the Negative Binomial model performs much better.
  • Overall, Step 4 shows that burglary counts are highly overdispersed, and that the Negative Binomial regression is the appropriate model for capturing the relationship between burglaries and the spatial features created in previous steps.

Step 5: Spatial Cross-Validation

5.1 Leave-One-Group-Out cross-validation on 2017 data

Code
# Calculate which police district the centroid of the grid falls in
fishnet_centroids <- st_centroid(fishnet) %>%
  st_join(policeDistricts, join = st_within) %>%
  st_drop_geometry() %>%
  dplyr::select(uniqueID, District)

# Construct a dataset for cross-validation
fishnet_model <- fishnet %>%
  left_join(fishnet_centroids, by = "uniqueID") %>%
  filter(!is.na(District)) %>%
  mutate(
    District      = as.factor(District),
    zone_code_dom = as.factor(zone_code_dom)
  ) %>%
  filter(
    !is.na(countRodent),
    !is.na(mean_bldg_age),
    !is.na(zone_code_dom),
    !is.na(knn_mean),
    !is.na(dist_to_hotspot)
  )

# LOGO CV
districts  <- unique(fishnet_model$District)
cv_results <- tibble()

for (i in seq_along(districts)) {
  
  test_district <- districts[i]
  
  # Divide the training set/test set
  train_data <- fishnet_model %>% filter(District != test_district)
  test_data  <- fishnet_model %>% filter(District == test_district)
  
  train_data <- train_data %>%
    mutate(zone_code_dom = factor(zone_code_dom))
  
  test_data <- test_data %>%
    mutate(zone_code_dom = factor(zone_code_dom,
           levels = levels(train_data$zone_code_dom)))
  
  # Fit the NB model
model_cv <- MASS::glm.nb(
    countBurglaries ~ mean_bldg_age + zone_code_dom + dist_to_hotspot + knn_mean,
    data = train_data
  )
  
  # Make predictions on the test police district
 test_data <- test_data %>%
    mutate(
      prediction = predict(model_cv, newdata = ., type = "response"),
      abs_error  = abs(countRodent - prediction),
      sq_error   = (countRodent - prediction)^2
    )
 
   test_eval <- test_data %>% filter(!is.na(prediction))
  
 mae  <- mean(test_eval$abs_error, na.rm = TRUE)
 rmse <- sqrt(mean(test_eval$sq_error, na.rm = TRUE))
  
  # Save the result
cv_results <- bind_rows(
    cv_results,
    tibble(
      fold          = i,
      test_district = as.character(test_district),
      n_test        = nrow(test_eval),
      mae           = mae,
      rmse          = rmse
    )
  )
  
cat("  Fold", i, "/", length(districts),
      "- District", as.character(test_district),
      "- n_test:", nrow(test_eval),
      "- MAE:", round(mae, 2),
      "- RMSE:", round(rmse, 2), "\n")
}

5.2 Calculate and report error metrics (MAE, RMSE)

Code
cv_sorted <- cv_results %>%
  arrange(fold)

# Table 1: local MAE and RMSE
cv_sorted %>%
  knitr::kable(
    digits  = 2,
    caption = "Leave-One-District-Out CV Results (per district)"
  ) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Leave-One-District-Out CV Results (per district)
fold test_district n_test mae rmse
1 5 98 6.54 8.97
2 4 181 5.09 7.81
3 22 128 7.86 10.58
4 31 18 1.79 2.17
5 6 81 15.27 21.39
6 8 227 23.52 35.95
7 7 67 25.62 31.88
8 3 59 8.14 10.87
9 2 78 7.88 12.69
10 9 140 18.20 28.11
11 10 84 22.02 32.60
12 1 44 6.66 10.72
13 12 98 21.82 35.64
14 15 37 19.80 25.29
15 11 65 23.88 31.78
16 18 43 29.25 40.30
17 25 113 28.05 36.95
18 14 65 45.98 57.32
19 19 89 43.28 56.53
20 16 179 16.79 23.99
21 17 97 37.12 47.60
22 20 45 41.37 51.85
23 24 54 55.43 63.31
Code
# Table 2: Average MAE and RMSe
cv_summary <- cv_sorted %>%
  summarise(
    mean_MAE  = mean(mae,  na.rm = TRUE),
    mean_RMSE = mean(rmse, na.rm = TRUE),
    n_folds   = dplyr::n()
  )

cv_summary %>%
  knitr::kable(
    digits  = 2,
    caption = "Average MAE and RMSE across all districts (LOGO CV)"
  ) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Average MAE and RMSE across all districts (LOGO CV)
mean_MAE mean_RMSE n_folds
22.23 29.75 23

5.3 Visualize error metrics

Code
cv_by_district <- cv_results %>%
  group_by(test_district) %>%
  summarise(
    mae  = mean(mae,  na.rm = TRUE),
    rmse = mean(rmse, na.rm = TRUE),
    .groups = "drop"
  )

# Merge with the police district base map
police_cv <- policeDistricts %>%
  mutate(District = as.character(District)) %>%
  left_join(cv_by_district, by = c("District" = "test_district"))

# RMSE spatial distribution
ggplot(police_cv) +
  geom_sf(aes(fill = rmse)) +
  labs(
    title = "LOGO CV RMSE by Police District",
    fill  = "RMSE"
  ) +
  theme_minimal()

Code
# MAE spatial distribution
ggplot(police_cv) +
  geom_sf(aes(fill = mae)) +
  labs(
    title = "LOGO CV MAE by Police District",
    fill  = "MAE"
  ) +
  theme_minimal()

  • Prediction errors vary strongly across the city.
    • Both RMSE and MAE maps show that some police districts have much higher prediction errors than others.
    • The lightest areas (middle and far-north districts) have RMSE > 60 and MAE > 50, meaning the model struggles most in these regions.
  • High-error districts overlap with low burglary areas
    • The earlier burglary map showed that the northern parts of Chicago have very low burglary counts. These same areas now show the highest prediction errors.

5.4 Chapter Summary

  • Leave-One-District-Out Cross-Validation
    • The resaerch used police districts as spatial groups for cross-validation.
    • For each fold: Took one district as the test set, then used all the other districts as the training set. Fitted a Negative Binomial model with the same predictors as before: mean_bldg_age, zone_code_dom, dist_to_hotspot, and knn_mean. After that, Predicted burglary counts for the test district and calculated absolute error and squared error for each grid cell.
    • This process was repeated for every police district, so each district was left out once.
  • Error Metrics (MAE and RMSE)
    • MAE (Mean Absolute Error): average size of errors.
    • RMSE (Root Mean Squared Error): penalizes large errors more strongly.
  • Mapping the Errors
    • These maps show that prediction accuracy is not uniform across the city: Some districts (especially in the north) have higher MAE/RMSE, meaning the model struggles there. Districts in the central and southern hotspot areas tend to have lower errors, where burglary patterns are stronger and easier to learn.
  • Purpose of Step 5
    • This step tests the model in a realistic spatial way: it predicts for entire districts that were never seen in training.
    • It reveals where the model is reliable and where it is weak, instead of only reporting a single global fit statistic.
    • The results show that the Negative Binomial model captures burglary patterns better in high-crime, hotspot districts, but has larger uncertainty in low-crime districts.

Step 6: Model Evaluation

6.1 KDE baseline

Code
# 1. Extract XY coordinates of burglary points
bur_xy <- st_coordinates(burglaries)

# 2. Create a bounding window from the Chicago boundary
win_bbox <- spatstat.geom::owin(
  xrange = st_bbox(chicagoBoundary)[c("xmin", "xmax")],
  yrange = st_bbox(chicagoBoundary)[c("ymin", "ymax")]
)

# 3. Create a ppp object (point pattern) for burglaries
bur_ppp <- spatstat.geom::ppp(
  x      = bur_xy[, 1],
  y      = bur_xy[, 2],
  window = win_bbox
)

# 4. Kernel density estimate for the point pattern
#    IMPORTANT: call density.ppp(bur_ppp, ...) — no "X ="
bur_kde <- spatstat.explore::density.ppp(
  bur_ppp,
  sigma = 1000   # bandwidth in meters
)

# 5. Convert KDE to raster and extract values at grid centroids
bur_kde_raster <- raster::raster(bur_kde)
fishnet_centroids <- st_coordinates(st_centroid(fishnet))

fishnet$kde_value <- raster::extract(
  bur_kde_raster,
  fishnet_centroids
)

# 6. Replace NA values (outside the KDE window) with 0
fishnet$kde_value[is.na(fishnet$kde_value)] <- 0

6.2 Final model + predictions

Code
# Fit the final NB model using all available training cells
final_model <- glm.nb(
  countBurglaries ~ mean_bldg_age + knn_mean + dist_to_hotspot + zone_prefix_dom,
  data = fishnet_model
)

# Add NB predictions back to fishnet (matched by uniqueID)
fishnet <- fishnet %>%
  mutate(
    prediction_nb = predict(
      final_model,
      newdata = fishnet_model,
      type = "response"
    )[match(uniqueID, fishnet_model$uniqueID)]
  )

# Normalize KDE predictions to total burglary count
kde_sum   <- sum(fishnet$kde_value,       na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)

fishnet <- fishnet %>%
  mutate(
    prediction_kde = (kde_value / kde_sum) * count_sum
  )

fishnet <- fishnet %>%
  mutate(
    error_nb      = countBurglaries - prediction_nb,
    abs_error_nb  = abs(error_nb),
    error_kde     = countBurglaries - prediction_kde,
    abs_error_kde = abs(error_kde)
  )

6.3 Compare Actual vs NB vs KDE

Code
#| fig-width: 12
#| fig-height: 4

# Actual counts
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(
    name = "Count", option = "plasma", limits = c(0, 15)
  ) +
  labs(title = "Actual Burglaries (2017)") +
  theme_default()

# Negative Binomial predictions
p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(
    name = "Predicted", option = "plasma", limits = c(0, 15)
  ) +
  labs(title = "NB Model Predictions") +
  theme_default()

# KDE baseline predictions
p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(
    name = "Predicted", option = "plasma", limits = c(0, 15)
  ) +
  labs(title = "KDE Baseline Predictions") +
  theme_default()

# Combine the three maps
p1 + p2 + p3 +
  plot_annotation(
    title = "Actual vs Predicted Burglaries",
    subtitle = "Comparing NB Model and KDE Baseline"
  )

  • All three maps show the same broad structure: higher burglary levels in the middle north and middle south parts of Chicago, and much lower activity in the middle, far north and far south part of the city.
  • The actual burglary map shows sharp, uneven hotspots. The NB model captures the broad hotspot regions and major regional contrasts, but produces smoother predictions because it relies on explanatory variables such as building age, zoning, and hotspot distance.
  • The KDE baseline looks visually similar to the actual map, but this is expected, as KDE simply smooths the observed data without any modeling of underlying urban processes.

6.4 Model performance comparison (MAE / RMSE)

Code
# Calculate MAE and RMSE for NB model and KDE baseline
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae  = mean(abs(countBurglaries - prediction_nb)),
    model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
    kde_mae    = mean(abs(countBurglaries - prediction_kde)),
    kde_rmse   = sqrt(mean((countBurglaries - prediction_kde)^2))
  )

# Reshape and print the table
comparison %>%
  pivot_longer(
    everything(), names_to = "metric", values_to = "value"
  ) %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(
    digits = 2,
    caption = "Model Performance: NB vs KDE"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model Performance: NB vs KDE
approach mae rmse
model 2.5 3.52
kde 2.1 2.98
  • KDE achieves lower MAE and RMSE because it directly smooths the observed burglary events, essentially reproducing the spatial pattern of the actual data. This makes KDE a strong spatial baseline but not a predictive or explanatory model.
  • In contrast, the Negative Binomial model aims to explain burglary variation using built-environment and zoning characteristics, rather than simply interpolating past events. Therefore, its higher error values are expected and do not indicate model failure. Instead, the NB model provides interpretability and generalization ability that KDE cannot offer.

6.5 Spatial Distribution of Prediction Errors

Code
# Signed error (NB)
p_nb_err <- ggplot() +
  geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name     = "Error",
    low      = "#2166ac",
    mid      = "white",
    high     = "#b2182b",
    midpoint = 0,
    limits   = c(-10, 10)
  ) +
  labs(title = "NB Model: Signed Error") +
  theme_default()

# Absolute error (NB)
p_nb_abs <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(
    name   = "Abs. Error",
    option = "magma"
  ) +
  labs(title = "NB Model: Absolute Error") +
  theme_default()

# Signed error (KDE)
p_kde_err <- ggplot() +
  geom_sf(data = fishnet, aes(fill = error_kde), color = NA) +
  scale_fill_gradient2(
    name     = "Error",
    low      = "#2166ac",
    mid      = "white",
    high     = "#b2182b",
    midpoint = 0,
    limits   = c(-10, 10)
  ) +
  labs(title = "KDE Baseline: Signed Error") +
  theme_default()

# Absolute error (KDE)
p_kde_abs <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abs_error_kde), color = NA) +
  scale_fill_viridis_c(
    name   = "Abs. Error",
    option = "magma"
  ) +
  labs(title = "KDE Baseline: Absolute Error") +
  theme_default()

# Combine 4 plots in 2×2 layout
(p_nb_err + p_nb_abs) /
(p_kde_err + p_kde_abs)

  • NB Model: Signed Error(Top-Left)
    • Under-prediction happens more in major hotspot regions (especially central–south), while over-prediction appears more in low-crime northern areas.
    • This reflects the NB model’s reliance on built-environment variables, which smooth the predictions and do not fully capture local spikes.
  • NB Model: Absolute Error(Top-Right)
    • Absolute errors are larger mainly in districts with very high or very low burglary counts.
    • High-crime areas in the south show moderate errors but not extreme ones.
    • The NB model handles high-crime zones better because built-environment variables explain these areas more consistently.
  • KDE Baseline: Signed Error(Bottom-Left)
    • KDE errors are much more random and noisy.
    • The spatial error pattern lacks structure; KDE simply smooths nearby crime counts.
    • This randomness reflects that KDE is not a model. It only uses spatial proximity and does not incorporate zoning or urban characteristics.
  • KDE Baseline: Absolute Error(Bottom-Right)
    • Absolute errors are generally smaller than NB (expected), because KDE is effectively a smoothed version of the real data.
  • NB shows more structured errors because it predicts using explanatory variables. This is not a failure, it’s evidence that the NB model is capturing urban mechanisms, not just smoothing past data.

6.6 Chapter Summary

  • KDE Baseline:
    • This chapter built a KDE (Kernel Density Estimation) surface using the burglary point pattern.
    • KDE acts as a spatial baseline: it uses only the locations of past events, without any built-environment variables.
  • Final NB Model and Predictions:
    • In this chapter, I fitted a final Negative Binomial model using all training cells from fishnet_model, including mean_bldg_age, knn_mean, dist_to_hotspot, zone_prefix_dom.
    • Research generated NB predictions for each grid cell and stored them as prediction_nb. Meanwhile, this chapter used the KDE values to create a KDE prediction prediction_kde for each cell.
    • For both approaches, compute signed error and absolute error relative to the actual burglary counts.
  • Overall Interpretation of Step 6:
    • Step 6 evaluates whether the Negative Binomial model provides reasonable and useful predictions compared to a strong spatial baseline (KDE).
    • Results show that:
      • KDE achieves slightly lower MAE/RMSE because it effectively reproduces the observed pattern by smoothing the data.
      • The NB model offers interpretable, mechanism-based predictions, linking burglary risk to building age, zoning, rodent baiting requests hotspot distance, and its neighborhood feature, even though its errors are somewhat higher.
    • Therefore, KDE is useful as a benchmark for spatial clustering, while the NB model is more valuable for understanding and explaining where and why burglaries concentrate, and for generalizing to new times or scenarios.

Step7: Summary Statistics and Tables

7.1 Model Summary Table

Code
# Lookup full zoning labels
zoning_labels <- c(
  "RS"  = "Residential Single-Unit District",
  "RT"  = "Residential Two-Flat, Townhouse and Multi-Unit District",
  "RM"  = "Residential Multi-Unit District",
  "B"   = "Business",
  "C"   = "Commercial",
  "DC"  = "Downtown Core District",
  "DR"  = "Downtown Residential District",
  "DS"  = "Downtown Service District",
  "DX"  = "Downtown Mixed-Use District",
  "M"   = "Manufacturing",
  "PMD" = "Planned Manufacturing Districts",
  "PD"  = "Planned Development",
  "T"   = "Transportation",
  "POS" = "Parks and Open Space"
)

model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>%
  mutate(
    term = dplyr::case_when(
      term == "(Intercept)"      ~ "Intercept",
      term == "mean_bldg_age"    ~ "Mean building age",
      term == "knn_mean"         ~ "KNN mean (neighboring rodent baiting requests)",
      term == "dist_to_hotspot"  ~ "Distance to rodent hotspot",

      # Zoning coefficients: zone_prefix_domRS, zone_prefix_domRT
      
      grepl("^zone_prefix_dom", term) ~ paste0(
        "Zoning: ",
        sub("zone_prefix_dom", "", term), 
        " (",
        zoning_labels[sub("zone_prefix_dom", "", term)],
        ")"
      ),

      TRUE ~ term
    ),
    # Round all numeric columns
    dplyr::across(where(is.numeric), ~round(.x, 3))
  )

model_summary %>%
  kable(
    caption   = "Final Negative Binomial Model Coefficients (Exponentiated Rate Ratios)",
    col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  footnote(
    general = paste(
      "Rate ratios > 1 indicate a positive association with burglary counts,",
      "while rate ratios < 1 indicate a negative association.",
      "Zoning (land use) is a categorical variable; each zoning category is interpreted",
      "relative to the reference category (the omitted baseline level of zone_prefix_dom)."
    )
  )
Final Negative Binomial Model Coefficients (Exponentiated Rate Ratios)
Variable Rate Ratio Std. Error Z P-Value
Intercept 0.714 0.126 -2.667 0.008
Mean building age 1.020 0.001 16.204 0.000
KNN mean (neighboring rodent baiting requests) 1.003 0.002 1.613 0.107
Distance to rodent hotspot 1.000 0.000 -1.137 0.256
Zoning: C (Commercial) 0.896 0.089 -1.239 0.216
Zoning: DR (Downtown Residential District) 0.000 114839.703 0.000 1.000
Zoning: DS (Downtown Service District) 0.000 80757.872 0.000 1.000
Zoning: DX (Downtown Mixed-Use District) 1.391 0.359 0.921 0.357
Zoning: M (Manufacturing) 0.463 0.098 -7.891 0.000
Zoning: PD (Planned Development) 0.800 0.092 -2.432 0.015
Zoning: PMD (Planned Manufacturing Districts) 0.440 0.194 -4.233 0.000
Zoning: POS (Parks and Open Space) 0.771 0.156 -1.661 0.097
Zoning: RM (Residential Multi-Unit District) 1.107 0.112 0.908 0.364
Zoning: RS (Residential Single-Unit District) 0.865 0.058 -2.478 0.013
Zoning: RT (Residential Two-Flat, Townhouse and Multi-Unit District) 0.969 0.081 -0.393 0.694
Zoning: T (Transportation) 0.587 0.755 -0.706 0.480
Note:
Rate ratios > 1 indicate a positive association with burglary counts, while rate ratios < 1 indicate a negative association. Zoning (land use) is a categorical variable; each zoning category is interpreted relative to the reference category (the omitted baseline level of zone_prefix_dom).
  • Mean building age has a strong and significant positive effect. For each extra year of average building age in a grid cell, the expected burglary count increases by about 2%, holding other variables constant.

  • Manufacturing (M) and Planned Manufacturing Districts (PMD) both have p < 0.001, and their Rate Ratio both lower than 1, meaning that burglary risk in industrial zones is much lower than in the reference zoning category.

  • Planned development areas(PD)(RR ≈ 0.80, p ≈ 0.015) also show lower burglary risk.

  • Residential Single-Unit (RS)(RR ≈ 0.87, p ≈ 0.013) have somewhat lower burglary counts than the reference zoning.

  • Some downtown categories (DR, DS) have extreme rate ratios and huge standard errors, likely because there are very few grid cells of these types, so their coefficients should be interpreted with caution.

7.2 Discussion and Limitations

This study shows that the NB model can capture some broad burglary patterns, such as higher risk in areas with older buildings and lower risk in manufacturing zones. However, the model also has clear limitations. Some of the variables I included—such as rodent KNN and distance to rodent hotspots—were chosen somewhat loosely and do not strongly relate to burglary in the results. Because of this, the model cannot explain the sharp hotspots seen in the actual data. Overall, while the NB model offers some useful insights, its performance is restricted by both the choice of variables and the model structure, and future work should include more relevant predictors and more advanced spatial methods.